39 research outputs found
Scalable Recommendation with Poisson Factorization
We develop a Bayesian Poisson matrix factorization model for forming
recommendations from sparse user behavior data. These data are large user/item
matrices where each user has provided feedback on only a small subset of items,
either explicitly (e.g., through star ratings) or implicitly (e.g., through
views or purchases). In contrast to traditional matrix factorization
approaches, Poisson factorization implicitly models each user's limited
attention to consume items. Moreover, because of the mathematical form of the
Poisson likelihood, the model needs only to explicitly consider the observed
entries in the matrix, leading to both scalable computation and good predictive
performance. We develop a variational inference algorithm for approximate
posterior inference that scales up to massive data sets. This is an efficient
algorithm that iterates over the observed entries and adjusts an approximate
posterior over the user/item representations. We apply our method to large
real-world user data containing users rating movies, users listening to songs,
and users reading scientific papers. In all these settings, Bayesian Poisson
factorization outperforms state-of-the-art matrix factorization methods
Ask the GRU: Multi-Task Learning for Deep Text Recommendations
In a variety of application domains the content to be recommended to users is
associated with text. This includes research papers, movies with associated
plot summaries, news articles, blog posts, etc. Recommendation approaches based
on latent factor models can be extended naturally to leverage text by employing
an explicit mapping from text to factors. This enables recommendations for new,
unseen content, and may generalize better, since the factors for all items are
produced by a compactly-parametrized model. Previous work has used topic models
or averages of word embeddings for this mapping. In this paper we present a
method leveraging deep recurrent neural networks to encode the text sequence
into a latent vector, specifically gated recurrent units (GRUs) trained
end-to-end on the collaborative filtering task. For the task of scientific
paper recommendation, this yields models with significantly higher accuracy. In
cold-start scenarios, we beat the previous state-of-the-art, all of which
ignore word order. Performance is further improved by multi-task learning,
where the text encoder network is trained for a combination of content
recommendation and item metadata prediction. This regularizes the collaborative
filtering model, ameliorating the problem of sparsity of the observed rating
matrix.Comment: 8 page
On Sampling Strategies for Neural Network-based Collaborative Filtering
Recent advances in neural networks have inspired people to design hybrid
recommendation algorithms that can incorporate both (1) user-item interaction
information and (2) content information including image, audio, and text.
Despite their promising results, neural network-based recommendation algorithms
pose extensive computational costs, making it challenging to scale and improve
upon. In this paper, we propose a general neural network-based recommendation
framework, which subsumes several existing state-of-the-art recommendation
algorithms, and address the efficiency issue by investigating sampling
strategies in the stochastic gradient descent training for the framework. We
tackle this issue by first establishing a connection between the loss functions
and the user-item interaction bipartite graph, where the loss function terms
are defined on links while major computation burdens are located at nodes. We
call this type of loss functions "graph-based" loss functions, for which varied
mini-batch sampling strategies can have different computational costs. Based on
the insight, three novel sampling strategies are proposed, which can
significantly improve the training efficiency of the proposed framework (up to
times speedup in our experiments), as well as improving the
recommendation performance. Theoretical analysis is also provided for both the
computational cost and the convergence. We believe the study of sampling
strategies have further implications on general graph-based loss functions, and
would also enable more research under the neural network-based recommendation
framework.Comment: This is a longer version (with supplementary attached) of the KDD'17
pape
Characterizing and predicting repeat food consumption behavior for just-in-time interventions
National Research Foundation (NRF) Singapore under its International Research Centres in Singapore Funding Initiativ
Nutrigenomics: future for sustenance
Nutrigenomics deals with the effect of foods and food constituents on gene expression. It is a new concept in disease prevention and cure. Nutrigenomics conveys how nutrients influence our body to express genes, whereas nutrigenetics refers to how our body responds to nutrients. The various bioactive food components can alter the gene expression mechanisms. But our actual knowledge is so insufficient that the only use of such information may help to satisfy our imagination. If science could arrive at some more precise facts, that would have vast applications in medicine
Recommended from our members
Privatization and business groups: Evidence from the Chicago Boys in Chile
Business groups are the predominant organizational structure in modern Chile. This article tests the long-standing hypothesis that the privatization reform implemented by the “Chicago Boys” during the Pinochet regime facilitated the creation of new groups and hence the renovation of the country’s elites. Using new data we find that firms sold during this privatization later became part of new business groups, process aided by an economic crisis that debilitated traditional elites. Moreover, some firms were bought by Pinochet’s allies and were later used as providers of capital within groups. We conclude that privatizations can empower outsiders to replace business elites
Impact of COVID-19 on cardiovascular testing in the United States versus the rest of the world
Objectives: This study sought to quantify and compare the decline in volumes of cardiovascular procedures between the United States and non-US institutions during the early phase of the coronavirus disease-2019 (COVID-19) pandemic.
Background: The COVID-19 pandemic has disrupted the care of many non-COVID-19 illnesses. Reductions in diagnostic cardiovascular testing around the world have led to concerns over the implications of reduced testing for cardiovascular disease (CVD) morbidity and mortality.
Methods: Data were submitted to the INCAPS-COVID (International Atomic Energy Agency Non-Invasive Cardiology Protocols Study of COVID-19), a multinational registry comprising 909 institutions in 108 countries (including 155 facilities in 40 U.S. states), assessing the impact of the COVID-19 pandemic on volumes of diagnostic cardiovascular procedures. Data were obtained for April 2020 and compared with volumes of baseline procedures from March 2019. We compared laboratory characteristics, practices, and procedure volumes between U.S. and non-U.S. facilities and between U.S. geographic regions and identified factors associated with volume reduction in the United States.
Results: Reductions in the volumes of procedures in the United States were similar to those in non-U.S. facilities (68% vs. 63%, respectively; p = 0.237), although U.S. facilities reported greater reductions in invasive coronary angiography (69% vs. 53%, respectively; p < 0.001). Significantly more U.S. facilities reported increased use of telehealth and patient screening measures than non-U.S. facilities, such as temperature checks, symptom screenings, and COVID-19 testing. Reductions in volumes of procedures differed between U.S. regions, with larger declines observed in the Northeast (76%) and Midwest (74%) than in the South (62%) and West (44%). Prevalence of COVID-19, staff redeployments, outpatient centers, and urban centers were associated with greater reductions in volume in U.S. facilities in a multivariable analysis.
Conclusions: We observed marked reductions in U.S. cardiovascular testing in the early phase of the pandemic and significant variability between U.S. regions. The association between reductions of volumes and COVID-19 prevalence in the United States highlighted the need for proactive efforts to maintain access to cardiovascular testing in areas most affected by outbreaks of COVID-19 infection
Scalable inference of discrete data: user behavior, networks and genetic variation
Recent years have seen explosive growth in data, models and
computation. Massive data sets and sophisticated probabilistic models
are increasingly used in the fields of high-energy physics, biology,
genetics and in personalization applications; however, many
statistical algorithms remain inefficient, impeding scientific
progress.
In this thesis, we present several efficient statistical algorithms
for learning from massive discrete data sets. We focus on discrete
data because complex and structured activity such as chromosome
folding in three dimensions, human genetic variation, social
network interactions and product ratings are often encoded as simple
matrices of discrete numerical observations. Our algorithms derive
from a Bayesian perspective and lie in the framework of directed
graphical models and mean-field variational inference. Situated in
this framework, we gain computational and statistical efficiency
through modeling insights and through subsampling informative data
during inference.
We begin with additive Poisson factorization models for recommending
items to users based on user consumption or ratings. These models
provide sparse latent representations of users and items, and capture
the long-tailed distributions of user consumption. We use them as
building blocks for article recommendation models by sharing latent
spaces across readership and article text. We demonstrate that our
algorithms scale to massive data sets, are easy to implement and
provide competitive user recommendations. Then, we develop a
Bayesian nonparametric model in which the latent representations of
users and items grow to accommodate new data.
In the second part of the thesis, we develop novel algorithms for
discovering overlapping communities in large networks. These
algorithms interleave non-uniform subsampling of the network with
model estimation. Our network models capture the basic ways in which
nodes connect to each other, through similarity and popularity, using
mixed-memberships representations and generalized linear model
formulation.
Finally, we present the TeraStructure algorithm to fit Bayesian models
of genetic variation in human populations on tera-sample-sized data
sets (10^{12} observed genotypes, e.g, 1M individuals at 1M SNPs).
On real genomic data collected from thousands of individuals,
TeraStructure is faster than existing methods and recovers the latent
population structure with equal accuracy. On genomic data simulated at
the tera-sample-size scales, TeraStructure is highly accurate and is
the only method that can complete its analysis